-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spark350emr shim layer [EMR] #10202
Conversation
This PR targets to add a new shim layer spark350emr which supports running Spark RAPIDS on AWS EMR Spark 3.5.0. --------- Signed-off-by: Maomao Min <[email protected]>
This PR will add a new 350emr shim into branch-24.02, will this be included into our 24.02.0 release? Suppose compiling and integration tests will be running on EMR cluster. This would be a risk as burndown is only 1 week away 2024-01-29 according to our release plan, but may don't have time setting up compiling/integration test CI jobs on EMR cluster. |
@@ -160,6 +160,8 @@ def test_subtraction_ansi_no_overflow(data_gen): | |||
_decimal_gen_38_10, | |||
_decimal_gen_38_neg10 | |||
], ids=idfn) | |||
@pytest.mark.xfail(condition=is_spark_350emr(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll try to incorporate the mentioned fix.
aggregator/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024 copyrights for this and many other files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add.
<!-- --> | ||
<dependency> | ||
<groupId>org.slf4j</groupId> | ||
<artifactId>slf4j-api</artifactId> | ||
<version>${slf4j.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.logging.log4j</groupId> | ||
<artifactId>log4j-slf4j2-impl</artifactId> | ||
<version>${log4j.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.logging.log4j</groupId> | ||
<artifactId>log4j-api</artifactId> | ||
<version>${log4j.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.logging.log4j</groupId> | ||
<artifactId>log4j-core</artifactId> | ||
<version>${log4j.version}</version> | ||
</dependency> | ||
<dependency> | ||
<!-- API bridge between log4j 1 and 2 --> | ||
<groupId>org.apache.logging.log4j</groupId> | ||
<artifactId>log4j-1.2-api</artifactId> | ||
<version>${log4j.version}</version> | ||
</dependency> | ||
<!-- --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why all of these were added? Especially wondering about the log4j dependencies since we normally want to use slf4j APIs when logging instead of hitting log4j.
Normally we leave these sort of dependencies out because we explicitly want to pull in whatever version Spark is using and compile against that. The risk of explicitly pulling this in is that it might conflict with the version used by the particular Spark version we're compiling against and instead of a compile time problem we get a runtime problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this because I encountered errors related to log4j & slf4j such as NoSuchMethod
when running unit tests. After adding them, they were gone. I agree that it might introduce risks for this approach. Any better approaches we can try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to imply there's an issue with the test classpath where the APIs that were available at compile-time aren't present at runtime. I'd have Maven dump the compile and test classpaths and see if you can find jars that are missing in test vs. compile that would explain it. We should be picking these up as transitive dependencies from the Spark jars. Maybe there's an issue with the EMR Spark dependencies where that's somehow not the case (but only at test runtime?!).
@@ -0,0 +1,53 @@ | |||
/* | |||
* Licensed to the Apache Software Foundation (ASF) under one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that the emr/pom.xml is a new file in this PR with NVIDIA copyrights but this file does not have the copyright -- intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the copyright. Will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file's parent directory is named spark350 instead of spark350emr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
@@ -0,0 +1,112 @@ | |||
/* | |||
* Copyright (c) 2019-2021, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a brand new file.
* Copyright (c) 2019-2021, NVIDIA CORPORATION. | |
* Copyright (c) 2024, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
Co-authored-by: Navin Kumar <[email protected]>
For failing PR checks you need to run Line 72 in 645daa1
|
Closing until we can retarget to the latest branch |
Summary
This PR targets to add a new shim layer
spark350emr
which supports running Spark RAPIDS on AWS EMR Spark 3.5.0.Testing
Unit Testing
I ran full suite of unit tests with example command as below:
and got the following results:
After some investigation and analysis, I found the following:
We started an email-thread before to discuss these unit test failures but we still don't fix these test failures yet and need some help on these ones.
Integration Testing
I ran full suite of integration tests and found some test failures we've discussed before as follows:
The failed reason is that AWS EMR back-ported the fix about this issue in Spark 3.5.0. Thus, I mark them as xfail in this change.
Reproduction
To reproduce my test results, you can use the following environment:
branch-24.02
.3.5.0-amzn-0
. You can get it from AWS EMR cluster with releaseemr-7.0.0
.Also, for your convenience, I've already attached this patch and
scala-test-detailed-output.log
as here: unit_tests.zipOthers
For this change to build successfully, we also need the following change:
I think this is because that there is no such a private artifact for
spark350emr
. Do we need additional change for this or will the NVIDIA take care of it?